Reviews
The Log-Likelihood Gradient
Stochastic Maximum Likelihood and Contrastive Divergence
Pseudolikelihood
Score Matching and Ratio Matching
Denoising Score Matching
Noise-Contrastive Estimation
Estimating the Partition Function
- Annealed Importance Sampling
- Bridge Sampling

Reviews

Probability vs Compatibility
- $p(x) = \cfrac{1}{Z}\tilde{p}(x)$
- where $Z = \int_x \tilde{p}(x)$
- $Z$ is Partition Function
Energy-based model
- $\tilde{p}(\mathbf{x}) = \exp(-E(\mathbf{x}))$
  - to Make $\forall{\mathbf{x}},\tilde{p}(\mathbf{x}) > 0$
  - $E(\mathbf{x})$ is Energy Function
  - $\tilde{p}(\mathbf{x})$ in EBM is an example of a Boltzmann distribution.
MCMC
- We can do Sampling from System on Steady State even though random initializatoin
- Approximation for Integration

18.1 Log-likelihood Gradient

Motivation
- Iteration step for gibbs_update, k=100 is too expensive computationally.
- Reduce k into 1-20 for model on a small image patch

Set $\log p(\vec{x}) \approx \sum_d \log p(x_d|\vec{x}_{-d})$
Use Conditional Probability property
- $p(x_1|x_2)=\cfrac{p(x_1,x_2)}{p(x_2)}=\cfrac{\tilde{p}(x_1,x_2)}{\tilde{p}(x_2)}=\cfrac{\tilde{p}(x_1,x_2)}{\int_{x_1}\tilde{p}(x_1,x_2)}$
e.g.
- If we want to calculate $p([1,1,1,1])$ where each state = 1~3.
  - We have to evaluate $p(x_1=1|x_2=1, x_3=1, x_4=1])$
    - By calculating model $\tilde{p}$
      - $\tilde{p}(x_1=1||x_2=1, x_3=1, x_4=1)$
      - $\tilde{p}(x_1=2||x_2=1, x_3=1, x_4=1)$
      - $\tilde{p}(x_1=3||x_2=1, x_3=1, x_4=1)$
    - $p(x_1=1|x_2=1, x_3=1, x_4=1]) = \cfrac{\tilde{p}(x_1=1||x_2=1, x_3=1, x_4=1)}{\sum_k \tilde{p}(x_1=k||x_2=1, x_3=1, x_4=1)}$
  - iterate D step
  - Finally
    - $p(x_1=1|[1, 1, 1]) \times p(x_2=1|[1, 1, 1]) \times p(x_3=1|[1, 1, 1]) \times p(x_4=1|[1, 1, 1])$
- $K^D$ -> $K\times D$
This may look like an unprincipled hack, but it can be proven that estimation by maximizing the pseudolikelihood is asymptotically consistent
"Generalized" Pseudolikelihood
- Make dimensions into $m$ Partitions
  - If $m=1$
    - No partition
    - Exact likelihood
  - If $m=D$
    - Partition for a individual dimension
    - Same as Pseudolikelihood
  - If $1<m<D$
    - $\log p(x) \approx \sum_m p(x_m|\vec{x}_{-m})$
It can perform better than maximum likelihood for "tasks that require only the conditional distributions used during training"
- such as filling in small amounts of missing values.

Generalized pseudolikelihood techniques are especially powerful
- If designed to capture the most important correlations
  - For example, in natural images, pixels that are widely separated in space also have weak correlation

A more successful approach to extending the basic ideas of score matching to discrete data
Ratio?



In [ ]: